It is unlikely that data involves only a single data frame
You must join tables together
When you want to join two tables, you must find keys:
Primary key: set of variable(s) that uniquely identifies each observation.
Foreign key: set of variable(s) that corresponds to primary key in another table.
Example: Financials and Stock prices
Stock price data: contains daily stock prices
key: symbol and date
library(tidyquant)symbols <-c("AAPL","MSFT", "TSLA", "GME")stock_prices <-tq_get(symbols, from ="2023-01-01", to ="2023-12-31")stock_prices |>slice_head(n =1, by = symbol)
# A tibble: 4 × 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2023-01-03 130. 131. 124. 125. 112117500 124.
2 MSFT 2023-01-03 243. 246. 237. 240. 25740000 235.
3 TSLA 2023-01-03 118. 119. 105. 108. 231402800 108.
4 GME 2023-01-03 18.6 19.3 17.1 17.2 5135200 17.2
anti_join(): Keep all rows that that is not found in the right frame.
stock_prices |>anti_join(sp500_index, by =join_by(symbol)) |>slice_head(n =2, by = symbol)
# A tibble: 2 × 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GME 2023-01-03 18.6 19.3 17.1 17.2 5135200 17.2
2 GME 2023-01-04 17.2 17.9 16.9 17.3 3939300 17.3
Exercise
Type the code below and see what’s happening:
stock_prices |>semi_join(sp500_index, by =join_by(symbol)) |>slice_head(n =2, by = symbol)stock_prices |>anti_join(sp500_index, by =join_by(symbol)) |>slice_head(n =2, by = symbol)
Advanced joins 1
1:M (One to Many) matching
When foreign key is NOT uniquely identifying each observations:
Inner join with duplicate key
A copy of row is generated (i.e. broadcasted) to guarantee matching.
Important
What would happen if you perform left / right / inner / outer (full) join when you have 1:M relationship?
Think about in terms of number of rows you will eventually have.
Advanced joins 2
Non-equi joins:
when join includes inequality (e.g. key1 >= key2)
Cross joins : match every pair, Cartesian product (M:M matching)
Inequality joins: uses <, >, >=, <= instead of ==
Rolling joins: inequality with only closest matching (time series)
Overlap joins: join when within a range
Non-equi join example
Rolling joins
Suppose you have data in two different time frequency.
Stock trading data in minute-level frequency, and risk-free rate in daily (end-of-day) frequency.
How can you match the most recent risk-free rate to each stock trading data?
Non-equi join example 2
Overlap joins
Suppose a specific company was merged and its ticker symbol changed from “AABB” to “AAC”, then to “ABC”.
Some other company started to use “AABB” as its ticker.
In this case, when joining, you want to match ticker AND with specific date that is between dates when ticker symbol was active for the specific company.
SQL
SQL
Structured Query Language
SQL is a query language for storing and processing information in relational database.
It is standardized and widely used database language.
SQL basic verbs
Like dplyr, SQL has its own (but archaic) syntax:
SELECT: choose variables for the table
GROUP BY: group rows of data by grouper variable
ORDER BY: arrange the rows by the variable
WHERE: filter rows with given condition
HAVING: Similar to where but used with group by to filter data
JOIN: join tables, defaults INNER
LIMIT: show the head of table
WITH: temporary table for subquery
SQL and dplyr
SELECT, WHERE
WHERE is filtering function:
SELECT symbol, date, adjustedFROM stock_pricesWHERE symbol ='AAPL';
JOIN is INNER JOIN by default. Use LEFT JOIN for left join, etc.
SELECT sp.symbol, sp.date, sp.adjusted, si.sectorFROM stock_prices sp -- alias spJOIN sp500_index si ON sp.symbol = si.symbol;
Equivalent to
stock_prices |>inner_join(sp500_index, by =join_by(symbol)) |>select(symbol, date, adjusted, sector)
Complex Query
Subqueries in SQL are written with “WITH”
WITH high_weight_symbols AS (SELECT symbolFROM sp500_indexWHERE weight >0.03) SELECT symbol, AVG(adjusted) AS avg_price, SUM(volume) AS total_volumeFROM stock_pricesWHERE symbol IN (SELECT symbol FROM high_weight_symbols)GROUPBY symbolHAVINGAVG(adjusted) >50;
SQL is (somehow) easier to read, but it does NOT guarantee that the execution is in the order of writing.
The query usually starts with SELECT, but it is executed later.
Confusingly, you have to write things in pre-defined order (SELECT then FROM then WHERE etc…)
Also confusingly, some verbs have different nuances but does similar job (and vice versa)
WHERE and HAVING
GROUP BY and PARTITION BY
etc.
Notes on SQL
Source: Julia Evans, “Become a Select STAR”
Database in R
Prep
Connections pane in IDEs is helpful to understand and browse database. To enable, first install:
install.packages("duckdb") # Our in-memory DB engineinstall.packages("remotes") # Need remotes package for github pacakge installationremotes::install_github("rstudio/connections") # Install package from github
Open and Close connections
You can make a databse connection to anywhere:
Local memory
Local disk
Remote databse
library(connections)# by default, it is in-memory connectioncon <-connection_open(db_engine_name)
When finished, close connection with:
connection_close(con)
dbplyr: dplyr to SQL Databases
Tip
You can still use dplyr syntax to perform query on SQL databases, by using dplyr verbs on dbplyr object.
Package dbplyr converts dplyr to SQL code.
dbplyr is called backend of dplyr
Database Simulation
Creating in-memory database
Here I use duckdb engine with dbplyr.
Copy dataframe (in R memory) to database (in memory).
library(connections)library(dbplyr)# establish database connection (in-memory)con <-connection_open(duckdb::duckdb()) # Copying data to database connectioncopy_to(con, stock_prices, name ="stock_price_table")copy_to(con, sp500_index, name ="sp500_index_table")
Example 1
Running SQL query on the database
tbl(con, sql("SELECT symbol, date, adjusted FROM stock_price_table WHERE symbol = 'AAPL' LIMIT 5"))
Write an SQL query that selects symbol, date, adjusted, and volume from the stock_prices table where volume is less than 50,000,000 and adjusted is greater than 45.
Translate the SQL query into dplyr code using filter() and select()
Construct an SQL query that groups the stock_prices table by symbol and calculates the maximum and minimum values of adjusted (naming them max_price and min_price) as well as the count of trading days (as num_days).
Rewrite above query into dplyr, and confirm the results.
Example 2
Complex query example:
tbl(con, sql("WITH high_weight_symbols AS ( SELECT symbol FROM sp500_index_table WHERE weight > 0.03) SELECT symbol, AVG(adjusted) AS avg_price, SUM(volume) AS total_volumeFROM stock_price_tableWHERE symbol IN (SELECT symbol FROM high_weight_symbols)GROUP BY symbolHAVING AVG(adjusted) > 50"))
A subquery named as active_stocks to extract symbols from sp500_index where weight is above 0.025.
Query stock_prices to calculate the average adjusted price (as avg_price) and total volume (as total_volume) and number of days (as trading_days) for these symbols,
grouping by symbol and show only those symbols with at least 30 trading days (HAVING trading_days >= 30)
Convert into dplyr.
Database Simulation 2
Creating a local (disk) database
Create on-disk database, with explicit directory address:
# Establish a disk-based connection con_disk <-connection_open(duckdb::duckdb(), dbdir ="disk_db.duckdb") # Copying data to database connectioncopy_to(con_disk, stock_prices, name ="stock_price_table_disk", temporary =FALSE)copy_to(con_disk, sp500_index, name ="sp500_index_table_disk", temporary =FALSE)
# Assign an R object name to use dbplyrstock_price_db_disk <-tbl(con_disk, "stock_price_table") sp500_index_db_disk <-tbl(con_disk, "sp500_index_table")
The remaining workflow will be the same as shown in in-memory database.
Caveats: On-Disk Databases
On-Disk database can handle datasets larger than the system’s memory capacity.
However, overall performance still depends on the system’s memory and disk performance.
Processing Considerations:
Memory Usage: Complex operations (e.g., joins, sorts) may still be memory intensive.
Disk I/O: Query performance can be influenced by disk speed.
Caveats: dbplyr and SQL
Complex dplyr syntax may not be fully translate-able to SQL and vice versa. If you use database often, it is worthwhile to learn.